Project: Predict Graduation Rates (2019 U.S. College Data)

Author: Robert Zacchigna

Table of Contents

Introduction

The problem I will be addressing is how factors such as whether the University is Private, number of application, number of students accepted, number of enrollments, class rank in High school, out of state cost, room and board, Book cost, professors with PhD’s, student to faculty ratio, and expenditure per student can predict a university's student graduation rate.

University administrators would be interested to know what factors impact a student graduating from their university and how they could possibly increase the amount that graduate.

Research Questions

  • Which variables have the biggest significant impact on the graduation rate of a university?
  • Does an increase in professors with PhD’s have an effect on the student’s graduation rate?
  • Does a university being private or public have an effect on student graduation rate?
  • What is the distribution of universities with graduation rates higher than 30%, 50%, 70%, 90%?
  • Which model (linear, randomForest, clustering, etc.) is the most accurate to predict the graduation rate given the variables mentioned in the introduction?

Addressing Problem Statement

This approach will determine which of the above variables have a significant impact, create multiple models to determine the graduation rates of universities, and calculate accuracy of the models to determine the best one to use.

Dataset - U.S. College Data (2019)

Download Location: https://www.kaggle.com/yashgpt/us-college-data

Columns:

  • Private – Categorical variable of Yes and No indicating private or public university (Yes = Private, No = Public)
  • Apps – Number of applications received
  • Accept – Number of applications accepted
  • Enroll – Number of new students enrolled
  • Top10perc - Percentage of new students from top 10% of H.S. class
  • Top25perc - Percentage of new students from top 25% of H.S. class
  • F.Undergrad - Number of fulltime undergraduates
  • P.Undergrad - Number of parttime undergraduates
  • Outstate - Out-of-state tuition
  • Room.Board – Room and Board costs (U.S. Dollars)
  • Books – Estimated Book costs (U.S. Dollars)
  • Personal – Estimated personal spending (U.S. Dollars)
  • PhD – Percentage of faculty with Ph.D.’s
  • Terminal - Percentage of faculty with terminal degree
  • S.F.Ratio - Student/Faculty Ratio
  • perc.alumni - Percentage alumni who donate
  • Expend – Instructional expenditure per student (U.S. Dollars)
  • Grad.Rate – Graduation rate (Percentage)

Libraries

In [2]:
library(e1071)
library(dplyr)
library(caret)
library(purrr)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(tidyverse)
library(randomForest)

Part 1: Exploratory Data Analysis

Import and Clean Dataset

In [13]:
college_data <- read.csv("College_Data.csv")

head(college_data)
A data.frame: 6 × 19
Uni.NamePrivateAppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.Rate
<chr><chr><int><int><int><int><int><int><int><int><int><int><int><int><int><dbl><int><int><int>
1Abilene Christian UniversityYes1660123272123522885 537 744033004502200707818.112 704160
2Adelphi University Yes218619245121629268312271228064507501500293012.2161052756
3Adrian College Yes1428109733622501036 991125037504001165536612.930 873554
4Agnes Scott College Yes 417 3491376089 510 63129605450450 8759297 7.7371901659
5Alaska Pacific University Yes 193 146 551644 249 869 756041208001500767211.9 21092215
6Albertson College Yes 587 4791583862 678 41135003335500 6756773 9.411 972755

Convert "Yes" (Private) and "No" (Public) to Numbers, Yes = 2, No = 1

In [14]:
college_data$Private <- as.integer(ifelse(college_data$Private=="No", 1, 2))

Round S.F.Ratio and Convert to Integer

In [15]:
# Round S.F.Ratio and convert to integer
college_data$S.F.Ratio <- round(college_data$S.F.Ratio)
college_data$S.F.Ratio <- as.integer(college_data$S.F.Ratio)

Rename Rows to University Names and Remove Uni.Name Column

In [18]:
rownames(college_data) <- college_data$Uni.Name
college_data <- within(college_data, rm(Uni.Name))

Plot Histograms of Variables to See Outliers

In [33]:
options(repr.plot.width=16, repr.plot.height=12)
par(mfrow=c(2,3))

for(i in colnames(college_data)){
    variable = as.numeric(unlist(college_data[i]))

    collegeHists <- hist(variable, main = paste(i, "Histogram", sep=" "), xlab = i)
    text(collegeHists$mids, collegeHists$counts, labels=collegeHists$counts, adj=c(0.5, -0.5))

    xfit <- seq(min(variable), max(variable), length = nrow(college_data))
    yfit <- dnorm(xfit, mean = mean(variable), sd = sd(variable))
    yfit <- yfit * diff(collegeHists$mids[1:2]) * length(variable) 

    lines(xfit, yfit, col = "Blue", lwd = 2)
}

Filter Dataset by Three Standard Deviations Away From the Mean

Use data that is at most three standard deviations away from the mean of that particular variable in both the positive and negative directions.

Find all Standard Deviations Away From the Mean

In [34]:
# Initialize data frame
Value_MinMax <- as.data.frame(cbind("temp" = c(0, 1)))
    
# Find all standard deviations that are three times away 
# from the mean in the positive and negative direction
for(i in colnames(college_data)){
  variable = as.numeric(unlist(college_data[i]))

  Value_MinMax <- cbind(Value_MinMax, 
                        i = c(round(mean(as.numeric(variable)) + sd(variable) * 3),
                              round(mean(as.numeric(variable)) - sd(variable) * 3)))
}

# Remove temp column for data frame initialization
Value_MinMax <- subset(Value_MinMax, select=-c(temp))

# Apply proper column names for the variables
colnames(Value_MinMax) <- colnames(college_data)

# Apply row names
rownames(Value_MinMax) <- c("Max Accepted Value", "Min Accepted Value")

Value_MinMax
A data.frame: 2 × 18
PrivateAppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.Rate
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Max Accepted Value314612 9372 3568 80115 18251 54232251076481045337212212426 6025325117
Min Accepted Value0-8609-5335-2008-25 -4-10851-3712-16281067 54-691 24 36 2-14-6005 14

Apply Filter

In [35]:
clean_data <- college_data %>% rownames_to_column('gene') %>% 
      filter(Private %in% (Value_MinMax[[1]][1]:Value_MinMax[[1]][2]),
             Apps %in% (Value_MinMax[[2]][1]:Value_MinMax[[2]][2]),
             Accept %in% (Value_MinMax[[3]][1]:Value_MinMax[[3]][2]),
             Enroll %in% (Value_MinMax[[4]][1]:Value_MinMax[[4]][2]),
             Top10perc %in% (Value_MinMax[[5]][1]:Value_MinMax[[5]][2]),
             Top25perc %in% (Value_MinMax[[6]][1]:Value_MinMax[[6]][2]),
             F.Undergrad %in% (Value_MinMax[[7]][1]:Value_MinMax[[7]][2]),
             P.Undergrad %in% (Value_MinMax[[8]][1]:Value_MinMax[[8]][2]),
             Outstate %in% (Value_MinMax[[9]][1]:Value_MinMax[[9]][2]),
             Room.Board %in% (Value_MinMax[[10]][1]:Value_MinMax[[10]][2]),
             Books %in% (Value_MinMax[[11]][1]:Value_MinMax[[11]][2]),
             Personal %in% (Value_MinMax[[12]][1]:Value_MinMax[[12]][2]),
             PhD %in% (Value_MinMax[[13]][1]:Value_MinMax[[13]][2]),
             Terminal %in% (Value_MinMax[[14]][1]:Value_MinMax[[14]][2]),
             S.F.Ratio %in% (Value_MinMax[[15]][1]:Value_MinMax[[15]][2]),
             perc.alumni %in% (Value_MinMax[[16]][1]:Value_MinMax[[16]][2]),
             Expend %in% (Value_MinMax[[17]][1]:Value_MinMax[[17]][2]),
             Grad.Rate %in% (Value_MinMax[[18]][1]:Value_MinMax[[18]][2])) %>% 
      column_to_rownames('gene')

Clean Dataset Structure

In [49]:
# Structure of cleaned dataset
paste0('Cleaned Dataset Structure')
str(clean_data)

# Cleaned Dataset
paste0('Cleaned Dataset Head')
head(clean_data)
'Cleaned Dataset Structure'
'data.frame':	685 obs. of  18 variables:
 $ Private    : int  2 2 2 2 2 2 2 2 2 2 ...
 $ Apps       : int  1660 1428 417 193 587 353 1899 1038 582 1732 ...
 $ Accept     : int  1232 1097 349 146 479 340 1720 839 498 1425 ...
 $ Enroll     : int  721 336 137 55 158 103 489 227 172 472 ...
 $ Top10perc  : int  23 22 60 16 38 17 37 30 21 37 ...
 $ Top25perc  : int  52 50 89 44 62 45 68 63 44 75 ...
 $ F.Undergrad: int  2885 1036 510 249 678 416 1594 973 799 1830 ...
 $ P.Undergrad: int  537 99 63 869 41 230 32 306 78 110 ...
 $ Outstate   : int  7440 11250 12960 7560 13500 13290 13868 15595 10468 16548 ...
 $ Room.Board : int  3300 3750 5450 4120 3335 5720 4826 4400 3380 5406 ...
 $ Books      : int  450 400 450 800 500 500 450 300 660 500 ...
 $ Personal   : int  2200 1165 875 1500 675 1500 850 500 1800 600 ...
 $ PhD        : int  70 53 92 76 67 90 89 79 40 82 ...
 $ Terminal   : int  78 66 97 72 73 93 100 84 41 88 ...
 $ S.F.Ratio  : int  18 13 8 12 9 12 14 11 12 11 ...
 $ perc.alumni: int  12 30 37 2 11 26 37 23 15 31 ...
 $ Expend     : int  7041 8735 19016 10922 9727 8861 11487 11644 8991 10932 ...
 $ Grad.Rate  : int  60 54 59 15 55 63 73 80 52 73 ...
'Cleaned Dataset Head'
A data.frame: 6 × 18
PrivateAppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.Rate
<int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int>
Abilene Christian University21660123272123522885537 74403300450220070781812 704160
Adrian College21428109733622501036 99112503750400116553661330 873554
Agnes Scott College2 417 3491376089 510 63129605450450 8759297 8371901659
Alaska Pacific University2 193 146 551644 249869 756041208001500767212 21092215
Albertson College2 587 4791583862 678 41135003335500 6756773 911 972755
Albertus Magnus College2 353 3401031745 416230132905720500150090931226 886163

Clean Histograms

In [36]:
options(repr.plot.width=16, repr.plot.height=12)
par(mfrow=c(2,3))

for(i in colnames(clean_data)){
    variable = as.numeric(unlist(clean_data[i]))
    
    cleanHists <- hist(variable, main = paste(i, "Histogram", sep=" "), xlab = i)
    text(cleanHists$mids, cleanHists$counts, labels=cleanHists$counts, adj=c(0.5, -0.5))

    xfit <- seq(min(variable), max(variable), length = nrow(clean_data))
    yfit <- dnorm(xfit, mean = mean(variable), sd = sd(variable))
    yfit <- yfit * diff(cleanHists$mids[1:2]) * length(variable) 

    lines(xfit, yfit, col = "Blue", lwd = 2)
}

Compare Original Dataset to Cleaned Dataset

We can also see that of the original 777 records, only 92 were removed, thus leaving us with 685 total records for this projects analysis.

In [44]:
paste0(nrow(college_data) - nrow(clean_data), ' Records Removed')
'92 Records Removed'

Now that we have the final dataset, we can dive deeper into how each of the variables correlate to each other. The best way i thought to do this was to create a correlation heatmap of all of the variables in my dataset to see which ones have the highest correlation with the graduation rate(Grad.Rate)

Correlation Matrix

In [50]:
# Correlation Matrix reordering by distance Function
reorder_cormat <- function(cormat){
  # Use correlation between variables as distance
  dd <- as.dist((1-cormat)/2)
  hc <- hclust(dd)
  cormat <-cormat[hc$order, hc$order]
}
In [51]:
cormat <- round(cor(clean_data), 2)

# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)

# Melt the correlation matrix
melted_cormat <- melt(cormat)
In [72]:
# Create the correlation matrix heatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal() + # minimal theme
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 15, hjust = 1)) +
  coord_fixed()
In [73]:
options(repr.plot.width=14, repr.plot.height=14)
par(mfrow=c(1,1))

# Add title and correlation values to squares
ggheatmap + labs(title = "Correlation Matrix Heatmap") + 
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
  theme(
    text = element_text(size = 20),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank())

Looking at the heatmap we can see that the variable with the highest correlation with Grad.Rate is Outstate at 0.54 and perc.alumni coming in close with 0.48. Neither of these values are good correlation with Grad.Rate and the rest of the variables fair far worse.

Before making any more assumptions i decided to run linear models on Grad.Rate with each of the other variables to determine their R^2 values and calculate the RMSE. These values, combined with the heatmap, will give me a better idea of how the variables relate to Grad.Rate and if there is any significance between them.

Part 2: Model Analysis

R^2 and RMSE

Calculate RMSE and R^2 values and store values into a dataframe for easy plotting

In [59]:
# RMSE calculation Function
RMSE <- function(error) { sqrt(mean(error^2)) }

# Initialize Dataframe to hold R^2 and RMSE values
RRMSE_df <- as.data.frame(rbind("temp" = c(0, 1, 2, 4)))

# All Variable Names
nameList <- names(clean_data)

# Loop to create linear models with each of the variables, with Grad.Rate as the independent variable
for (x in nameList){
  # Creates the models
  model <- lm(substitute(Grad.Rate ~ i, list(i = as.name(x))), clean_data)

  # Appending values to dataframe
  RRMSE_df <- rbind(RRMSE_df, i = c(x, summary(model)$r.squared, 
                                    summary(model)$adj.r.squared,
                                    RMSE(model$residuals)))
}

# Remove "temp" record from dataframe initialization
RRMSE_df <- RRMSE_df[-1,]

# Apply column names to new dataframe
colnames(RRMSE_df) <- c("Name", "R^2", "Adjusted R^2", "RMSE")

# Round each column values to 4 decminal places
RRMSE_df$`R^2` <- round(as.numeric(RRMSE_df$`R^2`), 4)
RRMSE_df$`Adjusted R^2` <- round(as.numeric(RRMSE_df$`Adjusted R^2`), 4)
RRMSE_df$RMSE <- round(as.numeric(RRMSE_df$RMSE), 4)

# Reorder R^2 column from least to greatest value
RRMSE_df <- RRMSE_df[order(RRMSE_df$`R^2`), drop = FALSE, ]

# Apply factor levels to prevent ggplot from reordering variable values
RRMSE_df$Name <- factor(RRMSE_df$Name, levels=unique(as.character(RRMSE_df$Name)))
RRMSE_df <- transform(RRMSE_df, Name=reorder(Name, `R^2`))
Warning message in model.matrix.default(mt, mf, contrasts):
"the response appeared on the right-hand side and was dropped"
Warning message in model.matrix.default(mt, mf, contrasts):
"problem with term 1 in model.matrix: no columns are assigned"

R^2

In [74]:
options(repr.plot.width=14, repr.plot.height=10)
par(mfrow=c(1,1))

# R^2 Graph from Models
ggplot(RRMSE_df, aes(R.2, Name)) + 
  labs(title = "R^2 of each model with Grad.Rate as the Indepedent Variable", 
       x = "R^2", y = "Variable Name") +
  scale_x_continuous(breaks = round(seq(min(RRMSE_df$R.2), max(RRMSE_df$R.2), by = 0.02), 2)) +
  geom_point() + 
  theme(text = element_text(size = 20))

We are looking for the variable with the highest R^2 and from this graph we can see that OutState has the highest R^2 of about 0.29.

This is not a good R^2 value but we do need to keep in mind that these models were only with one variable each and further modeling with multiple variables still needs to be done. This was simply to get an idea of how each of the variables fair on their own against Grad.Rate.

RMSE

In [75]:
options(repr.plot.width=14, repr.plot.height=10)
par(mfrow=c(1,1))

# Reorder RMSE column from least to greatest value
RRMSE_df <- RRMSE_df[order(RRMSE_df$RMSE), drop = FALSE, ]

# Apply factor levels to prevent ggplot from reordering variable values
RRMSE_df$Name <- factor(RRMSE_df$Name, levels=unique(as.character(RRMSE_df$Name)))
RRMSE_df <- transform(RRMSE_df, Name=reorder(Name, RMSE))

# RMSE calculation graph from model$residuals
ggplot(RRMSE_df, aes(RMSE, Name)) + 
  labs(title = "RMSE of each model$residuals with Grad.Rate as the Indepedent Variable", 
       x = "RMSE", y = "Variable Name") +
  scale_x_continuous(breaks = round(seq(min(RRMSE_df$RMSE), max(RRMSE_df$RMSE), by = 0.2), 1)) +
  geom_point() + 
  theme(text = element_text(size = 20))

We are looking for the variable with the lowest RMSE and from this graph we can see that Outstate has the lowest RMSE at about 13.7. The RMSE value on its own means nothing, it needs to be compared to the RMSE's of the other models, which is why I graphed the other RMSE's of the other models as well. This is what allows us to determine that Outstate has the lowest RMSE. Ideally we are looking for the model with the highest R^2 and the lowest RMSE, which right now appears to be Grad.Rate ~ Outstate.

Next Steps

With my current analysis so far, this is not a good model in predicting the graduation rate of universities and my conclusion would be that these variables are not good predictors of whether or not a student graduates. My next steps are to start combining multiple variables into models to see if I can improve those scores before moving on to other machine learning algorithms for further analysis.

The next step is start making models to find the significant variables and hopefully create better models as a result of what we know so far. I have created 4 linear models below and we will look at each of them to see the process i went through to arrive at the one i have selected.

Linear Modeling and Comparison

Model 1: All Variables

In [76]:
clean_lm1 <- lm(Grad.Rate ~ ., data = clean_data)

# Summary of the model
summary(clean_lm1)
Call:
lm(formula = Grad.Rate ~ ., data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.350  -7.248  -0.048   6.746  48.392 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 32.4499925  6.3335292   5.124 3.93e-07 ***
Private      4.8389269  1.8386518   2.632  0.00869 ** 
Apps         0.0020786  0.0007044   2.951  0.00328 ** 
Accept      -0.0012068  0.0013795  -0.875  0.38199    
Enroll       0.0038222  0.0032111   1.190  0.23435    
Top10perc    0.0095701  0.0823620   0.116  0.90753    
Top25perc    0.1909495  0.0603809   3.162  0.00164 ** 
F.Undergrad -0.0005814  0.0005513  -1.055  0.29198    
P.Undergrad -0.0015501  0.0007821  -1.982  0.04790 *  
Outstate     0.0012136  0.0002558   4.744 2.56e-06 ***
Room.Board   0.0020911  0.0006336   3.300  0.00102 ** 
Books       -0.0046632  0.0040718  -1.145  0.25252    
Personal    -0.0022758  0.0009267  -2.456  0.01431 *  
PhD          0.1403826  0.0645453   2.175  0.02998 *  
Terminal    -0.0943899  0.0686826  -1.374  0.16981    
S.F.Ratio   -0.1393575  0.1960667  -0.711  0.47748    
perc.alumni  0.3017563  0.0510849   5.907 5.55e-09 ***
Expend      -0.0014083  0.0002755  -5.112 4.16e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.27 on 667 degrees of freedom
Multiple R-squared:  0.4506,	Adjusted R-squared:  0.4366 
F-statistic: 32.18 on 17 and 667 DF,  p-value: < 2.2e-16

We are looking for all of the variables with a p-value less than 0.05, so make it easier to see I have filtered all the other variables that do not fit that criteria.

Significant Variables (P-Values)

This list shows us all the variables that are significant in this model and thus in the next model we will only use these variables and then repeat the process until we have a desired model.

In [78]:
p_values <- data.frame(cbind("P.value" = summary(clean_lm1)$coef[summary(clean_lm1)$coef[,4] <= .05, 4]))

# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"
A data.frame: 11 × 1
P.value
<dbl>
perc.alumni5.554407e-09
(Intercept)3.931069e-07
Expend4.163214e-07
Outstate2.564778e-06
Room.Board1.017457e-03
Top25perc1.635636e-03
Apps3.277498e-03
Private8.690303e-03
Personal1.430683e-02
PhD2.998477e-02
P.Undergrad4.790074e-02

Model 2: Significant Variables from Model 1

In [79]:
clean_lm2 <- lm(Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + Top25perc + 
                Apps + Private + Personal + PhD + P.Undergrad, clean_data)
    
summary(clean_lm2)
Call:
lm(formula = Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + 
    Top25perc + Apps + Private + Personal + PhD + P.Undergrad, 
    data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.609  -7.503  -0.212   6.829  51.101 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 25.4947692  4.1787646   6.101 1.78e-09 ***
perc.alumni  0.3048820  0.0503760   6.052 2.37e-09 ***
Expend      -0.0013222  0.0002309  -5.725 1.55e-08 ***
Outstate     0.0011706  0.0002460   4.758 2.40e-06 ***
Room.Board   0.0019596  0.0006084   3.221  0.00134 ** 
Top25perc    0.1968159  0.0339008   5.806 9.88e-09 ***
Apps         0.0015177  0.0002642   5.744 1.40e-08 ***
Private      5.4867688  1.7262886   3.178  0.00155 ** 
Personal    -0.0024475  0.0008903  -2.749  0.00614 ** 
PhD          0.0632297  0.0424775   1.489  0.13707    
P.Undergrad -0.0017891  0.0007068  -2.531  0.01160 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.26 on 674 degrees of freedom
Multiple R-squared:  0.4456,	Adjusted R-squared:  0.4374 
F-statistic: 54.17 on 10 and 674 DF,  p-value: < 2.2e-16

Significant Variables (P-Values)

In [80]:
p_values <- data.frame(cbind("P.value" = summary(clean_lm2)$coef[summary(clean_lm2)$coef[,4] <= .05, 4]))

# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"
A data.frame: 10 × 1
P.value
<dbl>
(Intercept)1.776753e-09
perc.alumni2.372000e-09
Top25perc9.875857e-09
Apps1.400662e-08
Expend1.553428e-08
Outstate2.395224e-06
Room.Board1.338150e-03
Private1.548833e-03
Personal6.136838e-03
P.Undergrad1.159664e-02

We can see from the new p-value list from this model that PhD is no longer significant. As a result, we can drop it from the next model. We can check the improvement of each of these models by using anova.

Model Improvement (1 and 2)

In [81]:
anova(clean_lm1, clean_lm2)
A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
1667100358.8NA NA NA NA
2674101270.3-7-911.48090.8654060.5335678

Ideally we want the Pr(>F) value to be less than 0.05, so while the model improved it did not improve enough to reject the null hypothesis and be significant.

Model 3: Significant Variables from Model 2

In [82]:
clean_lm3 <- lm(Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + 
                Top25perc + Apps + Private + Personal + P.Undergrad, clean_data)
    
summary(clean_lm3)
Call:
lm(formula = Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + 
    Top25perc + Apps + Private + Personal + P.Undergrad, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.422  -7.732  -0.354   6.807  50.036 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.9019158  3.4992836   8.259 7.73e-16 ***
perc.alumni  0.3095325  0.0503242   6.151 1.32e-09 ***
Expend      -0.0012584  0.0002271  -5.541 4.32e-08 ***
Outstate     0.0012215  0.0002439   5.009 6.99e-07 ***
Room.Board   0.0020489  0.0006059   3.381 0.000763 ***
Top25perc    0.2103792  0.0326827   6.437 2.31e-10 ***
Apps         0.0015451  0.0002638   5.857 7.38e-09 ***
Private      4.7486236  1.6550216   2.869 0.004243 ** 
Personal    -0.0024522  0.0008911  -2.752 0.006084 ** 
P.Undergrad -0.0016953  0.0007047  -2.406 0.016399 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.27 on 675 degrees of freedom
Multiple R-squared:  0.4438,	Adjusted R-squared:  0.4363 
F-statistic: 59.84 on 9 and 675 DF,  p-value: < 2.2e-16

Significant Variables (P-Values)

In [83]:
p_values <- data.frame(cbind("P.value" = summary(clean_lm3)$coef[summary(clean_lm3)$coef[,4] <= .05, 4]))
    
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"
A data.frame: 10 × 1
P.value
<dbl>
(Intercept)7.727327e-16
Top25perc2.311793e-10
perc.alumni1.320651e-09
Apps7.382955e-09
Expend4.324986e-08
Outstate6.994819e-07
Room.Board7.628753e-04
Private4.243466e-03
Personal6.083525e-03
P.Undergrad1.639913e-02

Model Improvement (1, 2, and 3)

In [84]:
anova(clean_lm1, clean_lm2, clean_lm3)
A anova: 3 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
1667100358.8NA NA NA NA
2674101270.3-7-911.48090.8654060.5335678
3675101603.2-1-332.92532.2126730.1373539

From the anova() function, the third model has improved a lot from the second model but the Pr(>F) value is still too low to reject the null hypothesis. So going back to the third models p-value list, we will drop the variables with the biggest p-value.

To speed things up i will be dropping the Room.Board, Personal, and P.Undergrad variables.

Model 4: Significant Variables from Model 3

In [85]:
clean_lm4 <- lm(Grad.Rate ~ perc.alumni + Expend + Outstate + Top25perc + Apps + Private, clean_data)

summary(clean_lm4)
Call:
lm(formula = Grad.Rate ~ perc.alumni + Expend + Outstate + Top25perc + 
    Apps + Private, data = clean_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.347  -8.083  -0.481   7.369  50.997 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 25.6802463  2.7000337   9.511  < 2e-16 ***
perc.alumni  0.3314103  0.0497241   6.665 5.49e-11 ***
Expend      -0.0012029  0.0002272  -5.295 1.61e-07 ***
Outstate     0.0016514  0.0002282   7.237 1.24e-12 ***
Top25perc    0.2100151  0.0329672   6.370 3.48e-10 ***
Apps         0.0014315  0.0002529   5.661 2.22e-08 ***
Private      6.2598739  1.6331208   3.833 0.000138 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.47 on 678 degrees of freedom
Multiple R-squared:  0.4232,	Adjusted R-squared:  0.418 
F-statistic: 82.89 on 6 and 678 DF,  p-value: < 2.2e-16

Significant Variables (P-Values)

In [86]:
p_values <- data.frame(cbind("P.value" = summary(clean_lm4)$coef[summary(clean_lm4)$coef[,4] <= .05, 4]))
    
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"
A data.frame: 7 × 1
P.value
<dbl>
(Intercept)3.215170e-20
Outstate1.242399e-12
perc.alumni5.485671e-11
Top25perc3.479171e-10
Apps2.219454e-08
Expend1.611701e-07
Private1.383241e-04

Model Improvement (1, 2, 3, and 4)

In [87]:
anova(clean_lm1, clean_lm2, clean_lm3, clean_lm4)
A anova: 4 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
1667100358.8NA NA NA NA
2674101270.3-7 -911.48090.8654065.335678e-01
3675101603.2-1 -332.92532.2126731.373539e-01
4678105368.7-3-3765.45398.3419291.884434e-05

We can see from the anova() function that the fourth model has improved significantly from the previous models, with a Pr(>F) value way below 0.05. With this value we can say that the model rejects the null hypothesis. Based on all the information i have been able to deduce from my data, this is the best model i can come up with and will be used for the remainder of the analysis.

R^2 and RMSE all Four Models

In [106]:
options(repr.plot.width=16, repr.plot.height=10)

models <- paste0("clean_lm", 1:4)
    
models_df <- as.data.frame(cbind("Name" = as.character(models), 
                                 "R^2" = sapply(models, function(x)round(summary(get(x))$r.squared, 4)), 
                                 "RMSE" = sapply(models, function(x)round(RMSE(get(x)$residuals), 4))))

# Reorder R^2 column from least to greatest value
models_df <- models_df[order(models_df$`R^2`), drop = FALSE, ]

# Apply factor levels to prevent ggplot from reordering variable values
models_df$Name <- factor(models_df$Name, levels=unique(as.character(models_df$Name)))
models_df <- transform(models_df, Name=reorder(Name, as.numeric(`R^2`)))

rsquared_Plot <- ggplot(models_df, aes(R.2, Name)) + 
  labs(title = "R^2 of each of the Four models", 
       x = "R^2", y = "Models") +
  geom_point(size=3) + 
  theme(text = element_text(size = 20))

# Reorder RMSE column from least to greatest value (dont eval this)
models_df <- models_df[order(models_df$RMSE, decreasing = TRUE), drop = FALSE, ]

# Apply factor levels to prevent ggplot from reordering variable values
models_df$Name <- factor(models_df$Name, levels=unique(as.character(models_df$Name)))
models_df <- transform(models_df, Name=reorder(Name, as.numeric(RMSE)))

# RMSE calculation graph from model$residuals
RMSE_Plot <- ggplot(models_df, aes(RMSE, Name)) + 
  labs(title = "RMSE of each of the Four models", 
       x = "RMSE", y = "Models") +
  geom_point(size=3) + 
  theme(text = element_text(size = 20))

grid.arrange(rsquared_Plot, RMSE_Plot)

While the fourth model does not have the largest R^2 or the smallest RMSE compared to the other models, the anova function tells us that it is the model with the best improvement from the previous models.

We can also take a look at the models in these graph views below, the main one we want to look at is the Normal Q-Q plot in the top right corner. We want the grouping of points to be as close to the dotted line as possible.

Linear Analysis Plots of all Models

In [125]:
options(repr.plot.width=16, repr.plot.height=10)

# Plotting Function for plotting all of the linear models graphs
plot_func <- function(x){
  par(mfrow=c(2,2))
  plot(get(x))
  title(x, line = -2, outer = TRUE)
}

# Plots the models
sapply(models, plot_func) + theme(text = element_text(size = 20))
NULL

As you look through the analysis graphs, the point grouping on Normal Q-Q plot gets closer and closer to the line, with the last model (clean_lm4) having the points closest to the line. All of these graphs still have skew on right hand tail but the fourth model is the best fit of them all.

For comparisons sake, i will do the same tests with the first model (using all variables) to compare the result of the fourth model.

Split Dataset into Train and Test

Now that I have the models that i want to use, i will be using some machine learning algorithms to start training and testing my data predictions and find the accuracy of my models.

First i will find the accuracy of my linear models predictions on my data and then i will move onto using randomforest and K Nearest Neighbor (knn) models.

In [127]:
# The models I will be using
model1 <- (Grad.Rate ~ .)
model4 <- (Grad.Rate ~ perc.alumni + Expend + Outstate + Top25perc + Apps + Private)

# First split the data into train and tests sets, 
# train will have 80% and test will have 20% of the data
in_train <- createDataPartition(clean_data$Grad.Rate, p = 0.8, list = FALSE)

# Create train and test datasets
train <- clean_data[in_train,]
test <- clean_data[-in_train,]

# Convert the Grad.Rate to a factor
test$Grad.Rate <- as.factor(test$Grad.Rate)

Linear Model Performance

In [131]:
# Train our first model, lm1
clean_lm1 <- train(model1, method = "lm", data = train)

# Train our fourth model, lm4
clean_lm4 <- train(model4, method = "lm", data = train)

# Add RMSE's and R^2's to dataframe
RMSE_data <- rbind("clean_lm1" = cbind(clean_lm1$results["RMSE"], clean_lm1$results["Rsquared"]), 
                   "clean_lm4" = cbind(clean_lm4$results["RMSE"], clean_lm4$results["Rsquared"]))

Model 1: All Variables

In [132]:
# clean_lm1's Output
print(clean_lm1)
Linear Regression 

549 samples
 17 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  12.55749  0.4309828  9.625618

Tuning parameter 'intercept' was held constant at a value of TRUE

Model 4: Significant Variables

In [130]:
# clean_lm4's Output
print(clean_lm4)
Linear Regression 

549 samples
  6 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  12.35196  0.4366988  9.540643

Tuning parameter 'intercept' was held constant at a value of TRUE

From the result of just linear models so far, it looks like the fourth model faired just a bit better than the first model with all of the variables, which is good because it means that the variables I removed from the first model did in fact improve the fourth model.

From each of the models output, we are looking at the RMSE and R^2 values. The RMSE is how close on average the predicted value was to the actual value, so the closer this value is to 0 the better. The bigger the R^2 value the better, with 1 being the best because it means that the independent variable explains more of the variance in the model from the dependent variables. We will need to complete all the other models and compare all the value together before we can draw any conclusions from their results.

RandomForest Model Performance

In [133]:
# Train first reandomForest model
clean_rf1 <- train(model1, method = "rf", data = train, trcontrol = trainControl(method = "none"))

# Train fourth randomForest model
clean_rf4 <- train(model4, method = "rf", data = train, trcontrol = trainControl(method = "none"))

# Add RMSE's and R^2's to dataframe
RMSE_data <- rbind(RMSE_data, 
                   "clean_rf1" = cbind("RMSE" = clean_rf1$results["RMSE"][[1]][1], 
                                       "Rsquared" = clean_rf1$results["Rsquared"][[1]][1]), 
                   "clean_rf4" = cbind("RMSE" = clean_rf4$results["RMSE"][[1]][1], 
                                       "Rsquared" = clean_rf4$results["Rsquared"][[1]][1]))

Model 1: All Variables

In [134]:
# clean_rf1's Output
print(clean_rf1)
Random Forest 

549 samples
 17 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
   2    12.57814  0.4493274  9.507977
   9    12.57032  0.4416994  9.469833
  17    12.73742  0.4274892  9.601804

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.

Model 4: Significant Variables

In [135]:
# clean_rf4's Output
print(clean_rf4)
Random Forest 

549 samples
  6 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE      
  2     12.84817  0.4142021   9.694953
  4     13.08392  0.3982994   9.892670
  6     13.30018  0.3836767  10.054632

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.

The randomForest model works by tuning the trees to get the best RMSE and R^2. From the model output we see three records under the sample results. Out of all of the results, the first record has the best RMSE and R^2 so it will be recorded and used to measure its effectiveness against the other models.

K Nearest Neighbors Model Performance (Knn)

In [136]:
# Train the first knn model
clean_knn1 <- train(model1, method = "knn", data = train, trcontrol = trainControl(method = "none"))

# Train the fourth knn model
clean_knn4 <- train(model4, method = "knn", data = train, trcontrol = trainControl(method = "none"))

# Add RMSE's and R^2's to dataframe
RMSE_data <- rbind(RMSE_data, 
                   "clean_knn1" = cbind("RMSE" = clean_knn1$results["RMSE"][[1]][1], 
                                       "Rsquared" = clean_knn1$results["Rsquared"][[1]][1]), 
                   "clean_knn4" = cbind("RMSE" = clean_knn4$results["RMSE"][[1]][1], 
                                       "Rsquared" = clean_knn4$results["Rsquared"][[1]][1]))

Model 1: All Variables

In [137]:
# clean_knn1's Output
print(clean_knn1)
k-Nearest Neighbors 

549 samples
 17 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results across tuning parameters:

  k  RMSE      Rsquared   MAE     
  5  14.84578  0.2677652  11.46806
  7  14.47124  0.2863643  11.14156
  9  14.23814  0.3004641  10.93290

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.

Model 4: Significant Variables

In [138]:
# clean_knn4's Output
print(clean_knn4)
k-Nearest Neighbors 

549 samples
  6 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 549, 549, 549, 549, 549, 549, ... 
Resampling results across tuning parameters:

  k  RMSE      Rsquared   MAE     
  5  15.28477  0.2378169  11.79466
  7  14.71887  0.2640440  11.32470
  9  14.35938  0.2853791  11.05542

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.

The knn model works by finding which number of clusters is best for the data to get the best RMSE and R^2. From the model output we see three records under the sample results. Out of all of the results, the third record has the best RMSE and R^2 so it will be recorded and used to measure its effectiveness against the other models.

With the analysis finished we can move on to answering the research questions and draw final conclusions about the analysis of this dataset.

Part 3: Research Question Answers

Which model (linear, logistic, decision trees, etc.) is the most accurate to predict the graduation rate given the variables mentioned in the introduction?

In [145]:
options(repr.plot.width=16, repr.plot.height=6)
par(mfrow=c(1,1))
    
for (i in c(1:ncol(RMSE_data))){
  plot(RMSE_data[[i]], main = paste(colnames(RMSE_data[i]), " of all of models", sep=""), 
       xaxt = "n", ylab = colnames(RMSE_data[i]), 
       xlab = "Models", col="lightblue", pch=19, cex=6)

  axis(1, at=1:6, labels=rownames(RMSE_data))

  abline(v=c(2.5, 4.5))

  # Add values to points for easier viewing
  text(RMSE_data[[i]], labels=round(RMSE_data[[i]], 3), cex=0.8, font=2)
}

Overall, looking at the graphs above, the model with the lowest RMSE and the highest R^2 is the 1st randomforest model (clean_rf1). The randomforest models faired about the same (just slightly better) as the linear models and curiously, the knn models faired far worse than both the linear and randomForest models. However, the R^2 is still very low and the RMSE means that model is only able to predict the Grad Rate of a University from within about the RMSE margin. In terms of a graduation rate, that is quite a large margin and as a result, this should probably not be used to predict a universities' graduation rate.

Which variables have the biggest significant impact on the graduation rate of a university?

First Model P-Values

In [142]:
# First linear model with all variables
p_values <- data.frame(cbind("P.value" = summary(clean_lm1)$coef[summary(clean_lm1)$coef[,4] <= .05, 4]))

# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"
A data.frame: 10 × 1
P.value
<dbl>
Outstate9.795743e-07
(Intercept)1.833433e-06
perc.alumni1.924528e-06
Expend8.915036e-06
Top25perc6.898967e-04
Room.Board7.225534e-04
Personal3.480108e-03
Apps8.247470e-03
P.Undergrad2.600143e-02
Private4.456978e-02

Fourth Model P-Values

In [143]:
# Fourth linear model
p_values <- data.frame(cbind("P.value" = summary(clean_lm4)$coef[summary(clean_lm4)$coef[,4] <= .05, 4]))

# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
Warning message in xtfrm.data.frame(x):
"cannot xtfrm data frames"
A data.frame: 7 × 1
P.value
<dbl>
(Intercept)1.169547e-16
Outstate5.567367e-14
Top25perc3.865583e-09
perc.alumni1.705947e-08
Expend1.464758e-06
Apps2.356472e-06
Private2.837982e-03

From the above lists we can see that for the first model the variables: perc.alumni, Expend, Outstate, Top25perc, Private, Personal, Room.Board, and Apps were all significant.

Since the fourth model performed better than the first one, i am inclined to say that the variables listed as significant in the fourth model have a bigger impact on the data than those in the first model: perc.alumni, Top25perc, Outstate, Apps, Expend, and Private.

What is the distribution of universities with graduation rates higher than 30%, 50%, 70%, 90%?

In [146]:
options(repr.plot.width=16, repr.plot.height=10)
par(mfrow=c(2,2))
    
for(i in c(30, 50, 70, 90)){
  rate <- clean_data %>% rownames_to_column('gene') %>%
    filter(Grad.Rate >= i) %>%
    column_to_rownames('gene')

  tempHist <- hist(rate$Grad.Rate, main = paste("Grad.Rate > ", i, "% (Total: ", count(rate), ")",sep=""),
                   xlab = "Graduation Rate (%)")

  text(tempHist$mids,tempHist$counts, labels=tempHist$counts, adj=c(0.5, -0.5))
}

Does an increase in professors with PhD’s have an effect on the student’s graduation rate?

In [163]:
# Copy dataframe
newPred <- clean_data

# Change the number of teachers with PhD's to 120 for all universities
newPred$PhD <- 120

# Make predictions using clean_lm4 because it had the best RMSE and R^2
pred <- as.data.frame(cbind("Grad.Rate" = round(predict(clean_lm4, newdata = newPred), 0)))

# Total Universities
print(paste0('Total Universities: ', sum(table(pred$Grad.Rate > clean_data$Grad.Rate))))
[1] "Total Universities: 685"
In [164]:
# Comparison of the grad rates
print('Comparision of Grad Rates')
table(pred$Grad.Rate > clean_data$Grad.Rate)
[1] "Comparision of Grad Rates"
FALSE  TRUE 
  339   346 

From the comparison we can see that there is a pretty even split in the data for which universities saw an increase or a decrease in their graduation rate, however, it is important to note that PhD had a very small impact on the models during our analysis and when running the same prediction with the clean dataset, we get the same results. As a result with this model, i am leaning towards that the number of teachers with PhD's does not have significant influence on the graduation rate.

Does a university being private or public have an effect on student graduation rate?

In [167]:
# Get list of private universities for prediction and clean datasets
privated <- clean_data %>% filter(Private == 2)
clean_priv <- clean_data %>% filter(Private == 2)

# Get list of public universities for prediction and clean datasets
public <- clean_data %>% filter(Private == 1)
clean_publ <- clean_data %>% filter(Private == 1)

# Change private schools to public
privated$Private <- 1

# Change public schools to private
public$Private <- 2

# Make predictions using clean_lm4 because it had the best RMSE and R^2

# Private to Public prediction
pred_Priv <- as.data.frame(cbind("Grad.Rate" = round(predict(clean_lm4, newdata = privated), 0)))

# Public to Private Prediction
pred_Publ <- as.data.frame(cbind("Grad.Rate" = round(predict(clean_lm4, newdata = public), 0)))

# Comparison of grad rates to original dataset

# Total Universities that were originally private
paste0('Total Universities that were originally private: ', sum(table(pred_Priv$Grad.Rate > clean_priv$Grad.Rate)))
'Total Universities that were originally private: 516'
In [168]:
# Private to Public comparison
print('Private to Public Comparison')
table(pred_Priv$Grad.Rate > clean_priv$Grad.Rate)
[1] "Public to Private Comparison"
FALSE  TRUE 
  352   164 
In [169]:
# Total Universities that were originally public
print('Total universities that were originally public')
sum(table(pred_Publ$Grad.Rate > clean_publ$Grad.Rate))
[1] "Total universities that were originally public"
169
In [170]:
# Public to Private comparison
print('Public to Private Comparison')
table(pred_Publ$Grad.Rate > clean_publ$Grad.Rate)
[1] "Public to Private Comparison"
FALSE  TRUE 
   56   113 

From the first comparison (Private to Public) of 516 originally private universities, about 32% of the total saw an increase in their Grad Rate and the rest (about 68%) either saw a decrease or it stayed the same. Based on this, i would say that if a private university decides to become public then it could be assumed that they would experience a decrease in their graduation rate.

From the second comparison (Public to Private) of 169 originally public universities, about 67% of the total saw an increase in their Grad Rate and the rest (about 33%) either saw a decrease or it stayed the same. Based on this, i would say that if a public university decides to become private then it could be assumed that they would experience an increase in their graduation rate.

However, we do still need to keep in mind that this result is based a model with a wide error margin and should not be taken too confidently.

Conclusion

In conclusion, based on all of the analysis, I am confident when I say that this dataset does not have either enough data or the right variables measured to predict the graduation rate of these universities to a confident degree. The models RMSE's and R^2's are not good enough to be able to use the models for graduation rate prediction of the universities. I would recommend that more data be collected and other variables of measure be collected as well to get a better spread of data to be to get more accurate models.

What this Means for the Universities

It can mean that the variables measured in this dataset do not actually have an effect on the graduation rate and thus the university administration does not need to spend more time and/or resources improving upon them to boost their grad rate. However, as i said above, i do believe that there needs to be more data collected and possibly more variable measurements added as well to more confidently agree with that statement.

Limitations and Improvements

This dataset is only limited to one year of data collection and I think it would benefit greatly if multiple years of data from these universities could be used, then a far more accurate model could be trained to predict their graduation rates. I also think that in order to improve this dataset and by extension the models derived from it, is to collect other bits of data about the university as well, such as: the number of students who graduate, the number of students who drop out by their school year (freshman, Sophomore, Junior, Senior) and possibly student demographic data (to get a better idea of the makeup of the student population).

In [ ]: